Bilingual Text, Matching using Bilingual Dictionary and Statistics

نویسندگان

  • Takehito Utsuro
  • Hiroshi Ikeda
  • Masaya Yamane
  • Yuji Matsumoto
  • Makoto Nagao
چکیده

This paper describes a unified framework for bilingnal text matching by combining existing hand-written bilingual dictionaries and statistical techniques. The process of bilingual text matching consists of two major steps: sentence alignment and structural matching of bilingual sentences. Statistical techniques are apt plied to estimate word correspondences not included in bilingual dictionaries. Estimated word correspondences are useful for improving both sentence alignment and structural matching. 1 I n t r o d u c t i o n Bilingnal (or parallel) texts are useful as resources of linguistic knowledge as well as in applications such as machine translation. One of the major approaches to analyzing bilingual texts is the statistical approach. The statistical approach involves the following: alignment of bilingual texts at the sentence level nsing statistical techniques (e.g. Brown, Lai and Mercer (1991), Gale and Church (1993), Chen (1993), and Kay and RSscheisen (1993)), statistical machine translation models (e.g. Brown, Cooke, Pietra, Pietra et al. (1990)), finding character-level / word-level / phrase-level correspondences from bilingual texts (e.g. Gale and Church (1991), Church (1993), and Kupiec (1993)), and word sense disambiguation for MT (e.g. Dagan, Itai and Schwall (1991)). In general, the statistical approach does not use existing hand-written bilingual dictionaries, and depends solely upon statistics. For example, sentence alignment of bilingual texts are performed just by measuring sentence lengths in words or in characters (Brown et al., 1991; Gale and Church, 1993), or by statistically estimating word level correspondences (Chen, 1993; Kay and RSscheisen, 1993). The statistical approach analyzes unstructured sentences in bilingual texts, and it is claimed that the results are useful enough in real applications such as machine translation and word sense disambiguation. However, structured bilingual sentences are undoubtedly more informative and important for future natural language researches. Structured bilingual or multilingual corpora serve as richer sonrces for extracting linguistic knowledge (Klavans and Tzonkermann, 1990; Sadler and Vendelmans, 1990; Kaji, Kida attd Morimoto, 1992; Utsuro, Matsnmoto and Nagao, 1992; Matsumoto, l.shimoto and Utsuro, 1993; Utsuro, Matsumoto and Nagao, 1993). Compared with the statistical approach, those works are quite different in that they use word correspondence information available in hand-written bilingual dictionaries and try to extract structured linguistic knowledge such as structured translation patterns and case frames of verbs. For example, in Matsunloto et al. (1993), we proposed a method for finding structural matching of parallel sentences, making use of word level similarities calculated from a bilingual dictionary and a thesaurus. Then, those structurally matched parallel sentences are used as a source for acquiring lexical knowledge snch as verbal case frames (Utsuro et al., 1992; Utsuro et al., 1993). With the aim of acquiring those structnred linguistic knowledge, this paper describes a unilied framework for bilingual text matching by combining existing hand-written bilingual dictionaries and statistical techniques. The process of bilingual text matchin 9 consists of two major steps: sentence alignment and structural matching of bilingual sentences. In those two steps, we use word correspondence information, which is available in hand-written bilingual dictionaries, or not included in bilingual dictionaries but estimated with statistical techniques. The reasons why we take the approach of combining bilingual dictionaries and statistics are as follows: Statistical techniques are limited since 1) they require bilingnal texts to be long enough for extracting usefifl statistics, while we need to acquire structured liugnistic knowledge even from bilingual texts of about 100 sentences, 2) even with bilingual texts long enough for statistical techniques, useful statistics can not be extracted for low frequency words. For the reasons 1) and 2), the use of bilingual dictionaries is inevitable in our application. On the other hand, existing hand-written bilingual dictionaries are limited in that available dictionaries are only for daily wm'ds and usually domain specific on-line bilingual dictionaries are not available. Thus, statistical techniques are also inevitable for extracting domain specific word correspondence information uot included in existing bilingual dictionarie'~. At present, we are at tile starting point of combining existing bilingual dictionaries and statistical techniques. '['herefore, as statistical techniques tbr est imating word correspondences not included in bilingual dictionaries, we decided to adopt techniques a.s simple as possible, rather than techniques based-on complex probabilistic translation models such as in

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

EFL Translation Students' Perspective toward Using Bilingual Dictionary in Translation of Polysemous Words

This research presented the use of bilingual dictionary and addressed the EFL translation students' points of view on the use of bilingual dictionary in translating polysemous words (English to Persian). Moreo- ver, it aimed at finding the possible relationship between the effect of using bilingual dictionary by stu- dents in translating polysemous words and their achieved scores. In the study ...

متن کامل

An Investigation into Bilingual Dictionary Use: Do the Frequency of Use and Type of Dictionary Make a Difference in L2 Writing Performance?

Bilingual dictionary use in L2 writing test performance has recently been the subject of debate. Opinions differ according to how the trait is understood and whether the system favors the process-oriented or product-oriented views towards the assessment and writing skill. Given the need for more empirical support, this study is aimed at investigating the availability of bilingual dictionary use...

متن کامل

Dictionary acquisition using parallel text and co-occurrence statistics

We present a simple and efficient approach for deriving bilingual dictionaries from sentence-aligned parallel text by extending the notion of co-occurrences to a cross-lingual setting. Dictionaries are evaluated against gold standards and manually; the analysis accounts for frequency and corpus size effects.

متن کامل

Translation By Machine Of Complex Nominals: Getting It Right

We present a method for compositionally translating noun-noun (NN) compounds, using a word-level bilingual dictionary and syntactic templates for candidate generation, and corpus and dictionary statistics for selection. We propose a support vector learning-based method employing target language corpus and bilingual dictionary data, and evaluate it over a English Japanese machine translation tas...

متن کامل

Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus

We propose a novel context heterogeneity similarity measure between words and their translations in helping to compile bilingual lexicon entries from a non-parallel English-Chinese corpus. Current algorithms for bilingual lexicon compilation rely on occurrence frequencies, length or positional statistics derived from parallel texts. There is little correlation between such statistics of a word ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994